Average word length | # of sentences | Source |
---|---|---|
5.80 | 16 | Abecedarium |
5.89 | 11 | Scrugulus Anas |
6.03 | 28 | Requiem |
6.04 | 24 | Numerus |
6.07 | 11 | Carmen Possum |
6.10 | 19 | Asterix |
6.25 | 12 | Adrochatio |
6.25 | 18 | Lingua Gullah |
6.25 | 14 | Breviarium Pii V |
6.27 | 13 | Alea |
6.27 | 11 | Index paparum |
6.29 | 13 | Tullus Hostilius |
6.31 | 21 | Index consulum rei publicae |
6.32 | 67 | Ars amandi |
6.34 | 15 | Rex (scacci) |
6.35 | 11 | Decalogus |
6.35 | 11 | Veni Sancte Spiritus |
6.35 | 11 | Defectio solis |
6.37 | 29 | Ventus |
6.40 | 10 | Epicurus |
6.41 | 11 | Veni Redemptor Gentium |
6.43 | 21 | Via 66 |
6.43 | 11 | Tria Regna Coreae |
6.43 | 11 | Carmina Gulielmi Shakespeare |
6.44 | 11 | Publius Scipio Nasica |
6.45 | 37 | Fors |
6.46 | 11 | Nicolaus Myrensis |
6.49 | 10 | Annuntiatio |
6.49 | 22 | Scacci |
6.49 | 70 | Latino sine Flexione |
Average word length | # of sentences | Source |
---|---|---|
20.69 | 50 | Divisio administrativa Russiae |
9.32 | 10 | Reccesvinthus, rex Visigothorum |
9.05 | 11 | Sovieti |
8.95 | 12 | Vitalius Bianchi |
8.85 | 20 | Eugenius Polivanov |
8.83 | 12 | Sergius Eisenstein |
8.82 | 41 | Linguarum officialium catalogus |
8.71 | 14 | Unio Rerum Publicarum Socialisticarum Sovieticarum |
8.70 | 12 | Philosophia |
8.47 | 13 | Fraternitas Sacerdotalis Sancti Petri |
8.43 | 14 | Nicolaus Gogol |
8.34 | 15 | Administratio regni Visigothorum |
8.33 | 14 | Vladimirus Bechterev |
8.32 | 15 | Italica Secunda Res Publica |
8.32 | 14 | Officiales Unionis Europaeae linguae |
8.30 | 29 | Vladimirus Lenin |
8.24 | 11 | Ioannes Baudouin de Courtenay |
8.23 | 18 | Nicolaus Cazantzaces |
8.23 | 13 | Georgius Ratzinger |
8.20 | 14 | Michael Lermontov |
8.17 | 14 | Philippus II (Macedonum rex) |
8.16 | 32 | Alexander Puskin |
8.16 | 28 | Elias Ehrenburg |
8.14 | 21 | Ioannes Bunin |
8.14 | 34 | Australia |
8.12 | 14 | Alexander Griboiedov |
8.09 | 10 | Ruthenia |
8.08 | 11 | Leges motus quanticae |
8.05 | 16 | Lingua Legionica |
8.05 | 14 | Vasingtonia, C.C. |
The problem addressed in this subsection (as well as the results) is similar to 6.4.1.1, but now we focus on average word length instead of average sentence length.
Measuring average word length strongly depends on tokenization. The usual tokenization might split the string “28.06.2005” into five parts “28 . 06 . 2005” of average length two. To avoid this, the number of words is counted as 1 + (number of blanks in the sentence).
select round(avg(length(sentence) / (1+ length(sentence) - length(replace(sentence," ","")))),2) as le, count(sentence) as cnt, source from sentences s, inv_so i, sources so where s.s_id=i.s_id and i.so_id=so.so_id group by source having cnt>=10 order by le limit 30;
6.4.2.2 Average logarithmic word rank for different sources
6.4.2.3 Sources consisting of many / few words with frequency 1
6.4.2.4 Sources with low / high average word length of rare words